Learning Machines 
A Survey by GORDON PASK 


Acknowledgements 


I would like to thank Mr. F. Coales, Dr. A. M. Uttley and 
Professor J. H. Westcott for their helpful criticism, and Mr. J. D. 
Cowan and Mr. B. N. Lewis for reading and commenting on the 
manuscript. My own work in this field has been mainly supported 
by the Air Force Office of Scientific Research, under Contract 
AF 61 (052)-640 through the European Office of Aerospace Re- 
search (OAR), United States Air Force. 


Summary 


Learning machines have increasing practical importance and it is 
necessary to develop a competent theory for these systems rather than 
treating them as though they were simply an intractable kind of adap- 
tive controller. Since learning is a goal directed change in a pattern 
of behaviour, it is a property that depends upon the environment and 
the objective of the machine, as well as the specification of the machine 
itself. Briefly, it is a property of an organization and, consequently, 
cybernetics offers the most suitable framework for a theory of learning 
machines. 

Present-day learning machines, which may be large and elaborate 
networks of adaptive components or the equivalent computer pro- 
grammes, stem from the pioneer work upon conditional probability 
machines and homeostatic systems. The mechanical concomitant of 
learning always involves structural modification, over and above adap- 
tive change in the system parameters, but the modifying process, al- 
though it may involve ‘random’ variation, is well specified in any 
realistic device (purely ‘random’ search is impracticable). Further, a 
useful learning machine has an hierarchical organization. If it is part 
of a control system there will be different levels of goal associated with 
different levels of control. In any case there will be more or less abstract 
representations of the environment. When the machine is required to 
learn about a symbolic or linguistic environment (in problem solving 
or in man-machine symbiosis) these levels correspond to metalanguages 
in terms of which internal data processing is carried out and external 
communication can readily take place. An evolutionary learning 
machine is able to build its own hierarchical structure and in some 
conditions this process may be identified with a form of ‘concept’ 
acquisition. In order to describe the underlying mechanism, it is neces- 
sary to introduce the idea of ‘reproducing’ or-unlocalized automata. 

Where design principles are concerned, brains and neural structures 
still provide the most useful data, though there is no necessary relation 
between the structural or functional specification of a brain and of a 
learning machine. Certainly there are features of development, like the 
need for maturation in contact with the environment that are common 
to each system. On the assumption that the biological analogy contains 
a fair grain of truth, the realization of tangible learning machines will 
call for large flexible and computationally ‘parallel’ assemblies of 
coupled transmission lines. It has been proposed that these active 
transmission lines would be optimally embodied in macromolecular 
structures and this suggestion appears to be physically plausible. 


1. Introductory Comments 


1.1. Importance of Learning Machines 


Learning machines are more than the ingenious toys they are 
commonly supposed to be. They constitute a large and important 


class of automata. Members of this class that have been realized 
as artefacts (mechanical or electronic models) provide existence 
proofs, often the only existence proofs available at the moment, 
for propositions in the theory of learning machines. The theory 
of automatic control and finite automata (devices such as switch- 
ing networks that compute a function of a well defined input) is 
a beautifully constructed but highly specialized fragment of the 
theory of learning machines. 

Beyond a certain limit, the fragment of a theory cannot be 
usefully extrapolated. Conceptually plausible analogies with 
familiar constructions break down and it becomes more con- 
venient to tackle the problems that arise in terms of a framework 
that embodies the original and well-known area of study as a 
special part. My contention is that this critical stage has been 
reached in connection with many problems of automatic control. 
Attempts to describe essentially learning situations in terms of the 
conventional formulations are often strained, or inadequate, or 
beset with illegitimate approximations. Whether he likes it or 
not, the control engineer is being forced to accept the need for 
a theory of learning machines and not infrequently he is also 
required to develop a theory where none exists. 

In this case, learning machines, however crude they may be, 
provide valuable indicators (some of them suggest solutions to 
particular problems; for example, in pattern recognition), but 
the mere existence of learning machines is insufficient. The pre- 
sent theoretical background (which is firm but far from elegant) 
needs rapid development in order to serve a new cybernetic 
technology of control and computation in systems previously 
regarded as intractable. 

To restrict the discussion, obviously trivial learning machines 
(tape recorders with cunningly manipulated inputs and outputs) 
will be excluded even though they are presented in a plausible 
guise. Next, non-trivial learning machines will be excluded if 
they are more opportunely described in some other fashion 
(adaptive systems like Gabor’s! genuinely ‘learn’ about a re- 
stricted environment but their activity can be analysed within 
the conventional theory of control and ‘learning’ need not be 
considered). One can omit the field of finite automata, in which 
the work of McCulloch and Pitts? on neural network analogues 
is pertinent to brains, but oriented towards computation rather 
than learning. Finally, for lack of space rather than relevance, 
a detailed examination of the ‘learning algorithms’ that feature 
in ‘artificial intelligence’ is avoided although it will be neces- 
sary to view the field rather broadly in 2.8. There are still, 
to my knowledge alone, over 350 significantly different learning 
machines which have either been simulated by computer pro- 
grammes or constructed as special mechanisms. 


1.2. Deterministic and Probabilistic Systems 


A system, whether it is an animal or a machine, is always 
specified in relation to a particular method of observation. In 
the case of animals this specification usually amounts to the ex- 
perimental conditions, the environment in which the animal is 
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placed (a rat in a maze) and the measurements it is proposed to 
make (whether the rat turns left or right). In the case of a ma- 
chine the specification is often dictated by the circuitry and, in 
so far as control mechanisms are concerned, the environment to 
be controlled. At any rate, whenever a system is mentioned there 
is also, by definition, a state description that lists the relevant 
states, which, in the ultimate analysis, are states which, for often 
excellent reasons, we, the experimenters, deem relevant. 


A behaviour is a sequence of output states (or actions) that 
are contingent upon input states or evidence and upon the inter- 
nal state of the system. If a behaviour is repeatable and consistent 
it can be ascribed to a particular computation that the system 
performs (in other words, in view of the observed regularity, the 
system can be described as though it had been built to compute 
a given function of its input). Of course, if the system is a 
machine that has been built for some purpose, it will be possible 
to infer this from an inspection of its circuit. Statistical invariants 
are countenanced as regularities and if these exist, even in the 
absence of any deterministic pattern of behaviour, the system is 
said to compute a probabilistic function of its input. 


There is partial ignorance of the internal state and the 
exhaustive input specification of men and animals, their experi- 
mental gambits are only defined at the probabilistic level. For 
constructed machines, with the exception of some evolutionary 
systems introduced in 2.7, a deterministic specification is pos- 
sible. It may, however, be completely impracticable, because the 
system is very large (networks in 2.4), because there is a 
‘redundancy of computation’ so that the same easily recognized 
behaviour may be due to the action of different mechanisms (as 
in many stable ‘error correcting’ computers), or because internal 
parameters of the system are changed as a function of its previous 
states in such a way that it proves impossible to assume a mo- 
mentarily static structure. 


Adaptive machines with a ‘black box’ for providing an un- 
correlated forcing input that modifies their parameter values, for 
example, the adaptive controller in Figures 6 or 7 and some of 
the machines described in 2.3 and 2.4 are strictly deterministic 
if the interior of the ‘black box’ is examined but it is useful to 
describe them as though they were probabilistic devices if the 
sequence of values of the forcing input is irrelevant (as it is when 
the machine is intended to control a statistically defined environ- 
ment). Nor are ‘random’ learning networks (like those described 
in 2.4) really probabilistic, because their initial connectivity is 
determined by a particular ‘random’ number table. In this case, 
however, a very different probabilistic fiction is sometimes use- 
ful. The ‘random’ number table is chosen to have certain statis- 
tical constraints and the particular network is regarded as a 
‘random’ sample from a statistical ensemble characterized by 
these constraints. Whereas the sequence of input events was 
previously deemed irrelevant to the adaptation of the given 
machine, in this case the variations of the machine within the 
statistical ensemble are deemed irrelevant. 


1.3. Minimal Criterion far Learning 


A minimal criterion for learning is a goal directed change in 
computation or possibly ‘a novel computation’. By analogy, 
when one says that a man ‘X’ has learned a skill or learned a 
telephone number one does not merely mean that some change 
has occurred in his brain (it certainly has, of course, and some 
kind of adaptation is a prerequisite for learning). One also means 


that the man has acquired the computing process involved in 
performing the skill or reciting the telephone number. 

Further, since learning is a property of a system’s behaviour, 
an immediate inference from 2.2 is that it will depend upon 
the experimental conditions (this machine will be a learn- 
ing machine in some circumstances but not in others). The man 
‘X’, for example, can only be said to have learned if he is present- 
ed with an environment in which he can recall his skill or his 
telephone number. 

Finally the property of learning is unaltered in an isomorphic 
model or representation of the system since it does not directly 
depend upon material construction. Nobody argues that men 
learn because they are made of protein or that computers cannot 
learn because they are made from metal. Rather, ‘learning’ be- 
longs to an organization some part of which may be common to 
men and computing machines. To summarize these points one 
can say that ‘learning’ is a cybernetic construct and is capable 
of embodiment in cybernetic models that represent organizations. 
The condition that learning is goal directed implies that an ob- 
server or model builder can understand the change in computa- 
tion that takes place as a result of learning and restricts the dis- 
cussion to non-trivial cybernetic models. 

In fact, the most interesting kinds of learning involve more 
specialized changes than have been described. A system may 
learn to abstract features of its environment (in other words, to 
represent a class of events in terms of attributes of this class). 
It may associate equivalence classes of its internal states (called 
‘signs’) with subsets of points in a descriptive space which has 
these features or attributes as its coordinates, when the ‘sign’ 
may be called a symbol and the subset of descriptive points, its 
denotation. The system may combine symbols or form classes 
of invariant relations between symbols (which, in the logician’s 
sense, can serve as universals). Finally, one might (in some care- 
fully selected circumstances) refer to the process whereby inter- 
nal or external events are signified and manipulated as ‘percepts’ 
and ‘concepts’. 

A great deal of machine learning does involve processes that 
appear to be isomorphic with ‘concept acquisition’ (using this 
phrase like those objectively oriented psychologists who aim for 
a functional account of mentation). Whether a given observer is 
prepared to say that the machine concerned does acquire con- 
cepts, or, indeed, whether he is prepared to say it learns (rather 
than ‘learns’) is a matter, as MacKay® points out, for that ob- 
server to decide. The decision has very little to do with the 
manifest behaviour and chiefly depends upon his personal view 
about how much he resembles the machine’. All a cybernetician 
can assert is that for some systems there is no mechanical ab- 
surdity involved in the contention that the device is able to 
learn or conceptualize. So far as the minimal criterion is con- 
cerned, at any rate, there is a tacit assumption that a ‘learning’ 
machine is literally accepted as a learning machine. 


1.4. Machines that Imitate Organisms 
There are two broad categories of learning machine: 


(1) Devices that imitate the undeniable learning behaviour of 
an organism. 


(2) Devices that manifest the abstract cybernetic property of 
learning. Frequently, the apparatus must also satisfy a practical 
goal like learning to recognize distorted alphabetical characters 
or learning to control a non-stationary process. 
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In either case, a simulation (a tangibly realized cybernetic 
model) must embody salient features of the environment as well 
as the learning system itself. 

Members of category (2) are fairly well known and are con- 
sidered at length in 2.2. For the moment let us examine a few 
members of category (1), which are less familiar in the discipline 
of engineering. 

The learning machines of (1) simulate animals or brains or 
parts of brains. Typical small animal simulators are Deutsche’s 
rat® (which ‘learns’ to run a maze according to the tenets of 
Deutsche’s learning theory) and Angyan’s turtle® (which ‘learns’ 
about a more liberally specified environment, using the strategies 
of sequential conditioning). In each case, the behaviour of a real 
animal is abstracted and the inbuilt mechanism constitutes a 
psychological or neurophysiological hypothesis about the gener- 
ation of learning behaviour which is tested and explicated by 
embodiment in an artefact. From the viewpoint of control 
engineering, their chief value stems from the argument given 
in 3.1, namely that although there is no necessary connection 
between learning machines and brains or animals, the most 
potent clues about the design of learning machines come from 
biology. 

The first of these models was constructed by Grey Walter? 
in the form of a mechanical ‘tortoise’ with auditory, visual and 
tactile sensory inputs and motor capabilities that drove it around 
the floor with a behaviour initially determined by inbuilt light 
responsive reflexes. Grey Walter used this machine to demon- 
strate the acquisition of conditioned reflexes. In particular, he 
aimed to show the consequences of a four stage learning hypo- 
thesis. 

(a) Signals representing unconditioned stimuli (in other 
words, stimuli that already give rise to some response, due to 
built-in association) are abrupt and demark the onset of a stimu- 
lus, whereas signals representing novel stimuli (which may be- 
come conditioned stimuli) are ‘stretched’. 

(6) The overlap between unconditioned and (already ‘stretch- 
ed’) novel stimuli is counted as a measure of the regularity of 
this joint event. 

(c) If the overlap count exceeds some threshold value an 
autonomous process, such as a damped long term oscillation, 
is initiated. Very tentatively, this event corresponds to an ‘in- 
sight’. 

(d) If the novel stimulus is combined with the existence of the 
autonomous oscillatory process that indicates frequent coin- 
cidence between the unconditioned and the novel stimuli, the 
novel stimulus becomes coupled to and will elicit the existing un- 
conditioned response. Hence, the novel stimulus becomes a con- 
ditioned stimulus. 

Learning machines that simulate the activity of a brain differ 
according to the degree of physiological verisimilitude and the 
level of abstraction involved in their design. At one extreme 
there are models, like Harmon’s artificial neurones’, that ac- 
curately replicate the characteristics of a real neurone (in so far 
as these are known from microelectrode recordings). Small col- 
lections of these elements (corresponding, at most, to ganglionic 
organizations) exhibit versatile behaviour patterns, some of 
which have properties akin to learning. At the other extreme 
there are statistical models, such as that of Beurle®: !°, in which 
a large collection of artificial neurones, characterizing a more 
abstract image of the real neurone than those designed by Har- 
mon, are connected to form a network that replicates the statisti- 


cally defined connectivity of the mammalian cerebral cortex. If 
one of Beurle’s self-oscillatory models is coupled to a suitable 
environment it, also, manifests learning behaviour. 

At this level, the systems of (1) and (2) above are distin- 
guished by intention alone and most of them are open to a dual 
interpretation. Thus one of the conditional probability machines 
of Uttley 4-15, which are described in detail in 2.3, can be 
arranged to imitate a brain very closely, but its value does not 
depend upon imitation. A conditional probability machine is 
interesting in its own right. 


1.5. The Material and Symbolic Dependence of a Realizable and 
Effective Learning Machine 


Although the fabric of a learning machine, synthesized on 
the basis of a cybernetic model, is irrelevant to learning (so that 
there is no need to distinguish between a computer programme 
and a special purpose device) it is not true that naturally occur- 
ring machines (animals and other organisms) are independent of 
their material character. The natural and the synthetic machines 
are both said to learn because their computation changes, but 
an animal changes its behaviour in order to survive in a change- 
ful environment. Hence it learns because it, or its species, ‘must 
learn’. Its ‘goal’ entails its existence. Further, its ‘goal’ is fashion- 
ed by the material used in its construction. In a constructed arte- 
fact the goal is introduced as part of the cybernetic model. 

Now if the learning machine is supposed to imitate an ani- 
mal, the goal will be an abstraction of certain energetic or ma- 
terial constraints that act upon the real animal. Ashby"* represents 
these constraints as a set of ‘essential variables’ (imaging, for 
example, the temperature and blood sugar concentration in a 
real creature) whose values must, in a cybernetic model, be main- 
tained within prescribed limits. On the other hand, if the model 
simulates the property of learning (without any biological over- 
tones) the goal is somewhat arbitrary and in practically applicable 
models it is often determined by a desire to optimize process 
parameters or to recognize specific figures. The important point 
is that we do not have the liberty in choice of a goal that a cyber- 
netic model suggests. No realisable machine can neglect the con- 
straints and continuities of a material system. 

The point is self-evident so far as simple adaptive controllers 
of the kind considered in 2.2 are concerned. No sane man de- 
signs a completely hapless servomechanism, but in this simple 
case, the machines inhabit a well-defined domain. Their goal, 
amongst other things, is always obvious. The physics of the situ- 
ation imposes a partitioning upon the environment that deter- 
mines a reasonable path whereby the machine can achieve one 
after another in an hierarchy of subgoals and ultimately satisfy 
the main objective. To avoid any confusion, let us reaffirm that 
the information conveyed by a message, coded as a signal 
sequence, is substantially independent of energy in a macroscopic 
system. The wise designer, however, gives up the freedom he 
might exercise and specifies an interaction between the machine 
and its environment wherein the form of message adheres to 
the form of energetic or material constraints. 

Many of the most useful applications of learning machines 
(in pattern recognition, the control of really large industrial pro- 
cesses, in management, mechanical translation, man machine 
symbiosis and in library retrieval), refer to a symbolic or 
linguistic universe of discourse which constitutes an almost un- 
charted environment wherein the designer has far greater free- 
dom than either nature or a prudent engineer. Even in the most 
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favourable cases where the goal is well defined (as it is in pattern 
recognition and is not in management) there is no guarantee 
that algorithms exist for achieving this goal or that partitions 
exist in the set of possibilities. 

These issues are, of course, interdependent and Aizerman** 
in his paper, cites a case where a character recognition machine 
can be designed, with reasonable goal approaching algorithms, 
because the symbolic environment satisfies a ‘compactness hypo- 
thesis’ which implies the possibility of partitioning (Aizerman’s 
‘compactness hypothesis’ reflects a cultural constraint analogous 
to the energetic constraints of the servomechanism designer. 
Other principles encountered in this field reflect linguistic, psy- 
chological and logical constraints). 

It can be shown that learning machines are able to perform 
abstractions and manipulate some kind of ‘concept”"®. It is argued 
in 2.4 and 2.5 that they can build non-trivial descriptive hier- 
archies, associated with ‘conceptual’ levels and directly related 
to the subgoals that are searched for in terms of a given level of 
abstraction. All this is a prerequisite for successful communica- 
tion with a symbolic environment but it is not enough to ensure 
effective control. In a given symbolic environment only some 
‘concepts’ can be generalized and, since generalization is the root 
of coherent inference, only some rather specialized concepts are 
useful (in the sense that only these yield mechanized inference in 
place of a haphazard and utterly impracticable search for the 
solution to a problem). A simple but adequate device is con- 
sidered by Ullman!*3, 

As a conjecture, any knowable symbolic environment is 
regular enough to admit inference by a machine that embodies 
the principles that determine its regularity (for, as Greene!’ so 
often points out, a learning machine cannot be a passive entity 
that is shaped by its experience. It must, if it learns by inference, 
have embedded preconceptions about its environment). It may 
turn out that the embodiment of these principles of regularity, 
like the evolutionary rules given in 2.7 und 2.8, depends upon 
a basic property of physical networks. At least one can cite man 
as the natural existence proof of such a machine. 


2. Machine Organization 


2.1. The Organization and the Environment of Learning Machines 


Apart from citing a few illustrative cases the first part of the 
discussion is concerned with information and computation, with 
the organizations that compute, their input and their output. 
Later one will examine the physical processes that mediate this 
organization. Since the minimal criterion for learning is stipu- 
lated in 1.3 as a change in the computation that characterizes 
a behaviour pattern, any learning machine ‘learns to compute’. 
Since this change must be goal directed (again from the criterion 
given in 1.3) any learning machine also ‘learns to control’ 
(even if only in the rather trivial sense that it controls the rein- 
forcement delivered by the experimenter who acts as its environ- 
ment). 

Different kinds of learning machine will be distinguished ac- 
cording to the control procedure they are able to learn, the 
form of environment in which they learn, and the goals they 
achieve. 


2.2. Well Defined and Nearly Stationary Kinds of Environment 


Suppose that the environment has the characteristics of a 
closed (or approximately closed) physical system such as an 


industrial process. It is described in terms of variables like tem- 
perature and pressure that are well defined, often continuous and 
always related by process equations. The goal may be to maximi- 
ze the output of a product. If the system has a stationary be- 
haviour (when it really is closed and its equations really are 
complete) the learning machine is employed for convenience 
alone. Its designer could have learned the equations himself and 
built them into the control mechanism. On the other hand, if the 
system has a non-stationary behaviour, the controller must adapt 
its form of computation to maintain the required goal condition. 
It must be an adaptive controller and may, perhaps, be a learn- 
ing machine. Andrew?° has discussed the subject in detail. 
Although the structure of an adaptive controller is well known, 
it is cited in Figure 1 for completeness and in order to introduce 
a symbolism. 
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Figure 1 


The overall controller B inspects the input and output state 
(the outcome or environment state) of a computer C that con- 
trols the environment. B determines proximity to a given goal 
as a value of 6 (x, y), a descriptor, and changes the function 
computed by C (as indicated by the parametric coupling) so that 
the goal state is approximated. The learning machine B, C, is a 
deterministic or stochastic hill climber in which C obeys a rela- 


tion of the form 
y= f(x) (1) 


where x € X is an input state, ye Y is an output state, f, is a 
function fe F with an index @ adjusted by the output of B and 
where y designates a selection );+, if x is a selection x;,f = 1,2,... 
The box B obeys a relation 


=G(x, y) (2) 


where G is a strategy chosen to maximize the value of 4 (x, y). 
Minsky and Selfridge! have usefully distinguished several 
forms of hill climbing. 

(a) The goal is a unique maximum of a numerical valued 
descriptor on the machine’s parameter space wherein points re- 
present the computed function. The learning machine is a simple 
optimizer. 

(5) The goal is a maximum of several local maxima, in which 
case the machine must include a ‘random’ perturbation in order 
to avoid suboptimal maxima. 

(c) The goal is a ‘spike in a plain’ in which case the strategy 
of the machine must be ‘hill finding’ rather than ‘hill climbing’. 
This situation is trivially dealt with by the unpracticable ex- 
pedient of ‘random’ search. In fact a very different machine is 
needed as argued in 2.7 and 2.8, adding a further case to the 
Minsky and Selfridge categories. 
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(d) The goal is an optimum sequential decision policy which 
can be shown to exist in several important cases (in particular 
when different values of the parameter @ determine Markovian 
subsystems which have the property of generating well-defined 
values of the stochastic variable to be optimized). 

With the possible exception of (a) above (which constitutes 
a limiting case) the function computed by the adaptive machine 
is probabilistic in which case the relation of eqn (1) is more 
conveniently written as: 

y=x(P*) (3) 
where 
P,=||p{P || =P4(y, | x) andi=1,2,...,n and 


j =1,2,..., mare indices of x;é X and of ye€ Y. 

Thus C selects amongst a set of n.m. conditional probability 
matrices P,. If the domain and the range of all P, is the same 
(the state sets X¥ and Y) the overall controller may institute a 
reinforcing strategy. In this case, if 0) is a given value of 0, the 
entire machine computes conditional probabilities. 


Py(v;| x) =P; O(x, y)=00|x)=p(y;,€|x) 


where the event & occurs if and only if 0 S 0. 


(4) 





The 
Environment 


In this case, G of (2) becomes a function that maximizes 95 
so that the system as a whole selects a sequence f ¢ Y that, given 
&< X, maximizes the expectation that the value 6 (x, y) > 6 
for 9 set as high as possible. 

If @ is discrete valued (as one assumes) then a useful iso- 
morphic representation of a learning machine is derivable by 
replacing the parameter variation by a selective process amongst 
computing machines or subcontrollers labelled in Figure 2 as 
Cy, Co, ..., Cm, @ = 1, 2,..., m able to compute one of the m 
possibly computable functions (the sign @ implies that c is 
selected, not that c receives an input channel, denoted as > [c). 
Since the inputs and the outputs of the c are parallel connected, 
it is unrestrictive to impose the rule that one and only one of 
them is selected at once. It is, of course, possible to select adap- 
tive subcontrollers, which is tantamount to selecting a selector 
and that set from which it selects. This image emphasizes the 
idea of ‘selective amplification’ first considered by Ashby”? and 
the related concept of ‘an hierarchy of control’. 

The selected subcontroller, C,, makes selections, when it 
computes, from yé Y on the evidence of a sequence ¥ c X. On 
the other hand, B selects amongst the c € C by assigning a value 
to the index variable @ on the basis of evidence that has been 
abstracted from the behaviour of the environment when control- 


Figure 2 


led by its own selection of C. Hence the selective process is 
partitioned and the selections made by B are ‘amplified’ in the 
sense that a research director’s selective activity is ‘amplified’ 
when, instead of selecting from solutions to a set of problems, 
he selects, instead, from a set of qualified scientists who solve 
most of these problems on his behalf. Regarded as a computer, B 
computes a function G with domain the values of 6 (x, y) [an 
index set of the product set (X¥, Y)] and with range the values of 
¢ (an index set of the set C). Hence G is a function of higher 
logical order than F,and B is higher in an hierarchy of control 
or organization than any subcontroller c € C. Indeed, in selecting 
a subcontroller, B selects a sign denoting the invariant that this 
subcontroller maintains. The decisions made by a learning 
machine that acts as a control mechanism are made about some 
abstract representation of the environment. 


In order to consolidate our ideas, the pioneer work of 
Uttley and Ashby is briefly considered. Uttley has stressed 
the process of abstraction and of inference about the existing 
state of affairs whereas Ashby is mostly concerned with issues 
of control and stability. These different approaches, however, 
turn out to be complementary and lead to very similar learning 
machines. 


2.3. Learning Machines Derived from Uttley’s Canditional Prob- 
ability Machine 


Consider a set of binary variables, a*, b*, c*, ... which as- 
sume non-zero values if and only if the environment possesses 
corresponding properties, a, b, c, .... Although there is no limit 
upon the number, M, of binary variables that may appear, 
M = 3 is used for illustration. 

The environment can be classified by an abstractive network 
of AND units, as shown in Figure 3, where a particular output 
of the network, such as a~ b, is denoted a ‘pattern’, and a ‘pat- 
tern’ (which is one element in a system of classification) is said 
to exist in the environment if (in the case cited) a*- b* = 1. 

The occurrence of a pattern (except the exhaustive pattern 
a: b+ c) does not exclude the presence of one or more other 
patterns (thus a*- b* = 1 does not exclude the possibility of 
a*- b*+ c* = 1 also). However, the network can be made to 
discriminate up to 2™ states of the binary variables if NOT 


The Environment 
‘ee = AND unit 


Figure 3 
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The Environment 





C = AND unit 
A\ = Unit delay 


Figure 4 


units are also incorporated. The network can also respond to 
sequences of events if suitable DELAY units are added. In the 
simplest case, if one and only one of a*, b*, and c* is non-zero 
at once, the network of Figure 4 can recognize sequential pat- 
terns like ‘a after 5’ and ‘b after a’. 

If the environment is deterministic, the output of such a net- 
work asserts the patterns which exist. On the other hand, if the 
variables a*, b5* and c* convey an imperfect image of the at- 
tributes a, b and c, its output provides evidence on the basis of 
which inferences can be made about the patterns that exist. Thus 
a* = | may provide evidence that a- b is present even though, 
on this particular occasion, a*- b* = 0 (the underlying assump- 
tion is that the environment has stationary statistical character- 
istics which can be estimated by a machine that has experienced 
the environment over an appreciable interval). 

A conditional probability machine (several have been built) 
is able to make inferences on the basis of counts and ratios of 
counts of the number of occasions upon which each output of 
an abstractive network has previously been energized. The 
number of previous energizations of output r is denoted as z (r) 
= z(r* = 1) when, citing the output a-b as a typical case, the 
estimate 

z(a‘b) 
z(a) 
in a stationary environment. Choosing a critical level of L the 
machine that embodies these requirements may infer a- b if, 
given a* = 1, p (a-b|a) is greater than L thus 





converges to p(a:b|a)=p(a:b|a*=1) (5) 


Given a* =1, infer a-b if p(a:bja)>L (6) 


Uttley!*. 415 has built conditional probability machines 
using logarithmic scale counters and forming ratios by the sub- 
traction of logarithms. He points out that a threshold unit r 
with impulse inputs of amplitude 7;- §;, weights aj, and output 
nr’ Br, such that ny is a unit impulse that occurs if and only if 


Yair Mi Bi> Hy (7) 


where fy is the threshold, can form the basic unit of a conditional 
probability machine providing each unit includes counters such 
that 


B, is inversely proportional to z(»,)=z(r) 
%;, is proportional to z(n,-1;) (8) 


These threshold units are arranged in the hierarchical struc- 
ture of an abstractive network, the output of any unit providing 
an input term of any unit in which it is conjoined. Thus for units 
a, b, and a b there is the structure of Figure 5. 

The value of Bap, given a* = 1, is the estimate of p (a-b|a) 
and the machine may infer a- b, given a if Bay > L. The crucial 
point of Uttley’s construction is that the conditional probability 
computation depends upon a pair of variables (interpreted as « 
and #). In this connection, Maron”; 4 has recently demonstrat- 
ed that a variable threshold element which forms a product from 
a set of m weighted inputs can emit an output impulse (designat- 
ing the assumed truth of an hypothesis) in consonance with an 
optimum Bayes decision strategy. In the first place it can be 
shown that an optimum decision strategy for the jth threshold 
element given the input evidence, 7), ..., %m,j > m, is satisfied by 


: ; 1 
nj=1 if and only if PCMj | M19-+-5 Mm) = (Nj Mi) > > (9) 





Figure 5 


Both the «,; and the threshold 4; are interpreted as variables 
dependent upon event counts that converge to p values (to main- 
tain correspondence with the rest of the discussion, assume that 
§; = 1 so that ; = f;+ 7j). If these variables are identified as 


_ PCHj) P(nylni) 


“i D(n,) P(ajlni) 
and 
=P) 
p(n) 


where #; implies the absence of an impulse 7;, so that p (7) is 
the probability that 7; = 0, then it is readily shown that (9) will 
be satisfied if an element has the transfer function 

i=m 


nj=1 if and only if [] «;*m>n,; 


i=1 


(10) 


Elements of this kind have been simulated by Maron and 
are physiologically plausible. 

Since at least a couple of variables, «;; and yj; are involved 
and since a logarithmic measure of the event counts allows the 
element to decide on the basis of a difference between the thresh- 
old and a sum there is a close relationship between this model 
and Uttley’s variable «, 8, system. 

Regarding Uttley’s work, an abstractive network of threshold 
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units can be regarded as an adaptive filter having an output of 
patterns, indicated or weighted with a certain amplitude of 
signal, and an input of binary variables indicating the existence 
of attributes. After some experience of its environment, this 
filter, given the index of an attribute, provides a high amplitude 
output from those channels associated with patterns in which 
the given attribute has often been conjoined (Thus, given the 
occurrence of a, the output of a- b is high if a: 6 has often 
occurred). 

Defining rarity as the negative logarithm of a probability, 
Uttley considers a conditional rarity machine built in the same 
manner but from units in which eqn (8) is replaced by 


B, proportional to z(n,) 
(11) 


which adapts to become a filter that given the occurrence of an 
attribute provides a high amplitude output from channels as- 
sociated with patterns in which the given attribute is rarely 
conjoined. (Thus, ifa+ b often occurs and a-c rarely occurs then, 
given a, the output of a-c is of high amplitude and the output 
of a: b is of low amplitude). 

It has been assumed that an abstractive network is speci- 
fied (so that the relevant patterns in the environment have already 
been determined). If this is not the case (and commonly it is not) 
it would be possible to build a complete network, with 2” out- 
puts. Providing that all the relevant attributes have been indexed 
by binary variables, the relevant patterns will exist in its output. 
However, the number of elements required to build this network 
becomes gigantic for large values of M and the construction 
would be impractical. Further, in a control mechanism, the 
relevant patterns which constitute the evidence for a decision are 
normally restricted. Hence it appears necessary to start off with 
uncommitted elements which the machine connects together de- 
pending upon its experience of the environment by some process 
of maturation which might be called ‘learning an abstraction’. 
The first possibility is that connections develop between coin- 
cidentally stimulated units a and 6 and a unit that comes to index 
their conjunction a: 6. Uttley discards this hypothesis because 
it entails the existence of a unit labelled a+ 6 to recognize the 
a and 5 conjunction in the first place. The alternative hypothesis 
is an over connected network of units which differentiates and 
becomes more selective as a result of its experience. The organi- 
zation of a conditional rarity machine can be shown to develop 


in this manner. 
Independent 
chance process 


“;, inversely proportional to z(n,-n;) 
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Figure 6 


Suppose there is a conditional probability (or a conditional 
rarity) machine delivering indices of the conditional probabilities 
of several states x, € X) denoted as p (x,|x;), given any x11¢ X — Xp 
where X, € X is a particular subset of states entailed by a control 
procedure with a decision rule A, from X, into Y. This machine, 
combined with an apparatus that embodies /, is a control mech- 
anism. 

The environment can only be specified statistically. Hence 
the decision rule 4 is embodied in a probabilistic device that 
selects y€ Y as a function of an independent chance-like process 
that is biased by « ||p (x,|x;)|| as in Figure 6. Thus the system’s 
behaviour is expressed by 


y=x(P) 
for xe X —Xq where P=||p(y;| x,)|| =||A[ p(x; | x) JI] (12) 


which is consonant with eqn (3). 

The formulation is essentially unchanged if the output selects 
amongst sequences of motor activities, rather than distinct re- 
sponse states. 

To complete the picture, an adaptive reinforcing procedure 
must be introduced. Several mechanisms have been proposed by 
Uttley and Andrew, of which the most convenient is to adjoin 
a further state, £, which exists if and only if 6 > 6, as in (4), to 
each conjunction in the abstractive network. Thus the machine 
computes conditional terms, which are ‘conditional upon rein- 
forcement’, yielding a set of values 


p(y:¢|x) 


as in (4) and comment that the abstractive system shown in 
Figure 7(a) is characterized, at a particular instant, by a matrix 


Py=|lp(y-¢lx)ll 
where ¢ is the adaptive parameter. 
Variation in the parameter ¢ will adequately describe the 
maturation as well as the adaptation of a probabilistic machine 






Limiter 6 >69 







Adaptive 
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The Environment 


Figure 7. In practical systems direct 0 reinforcement is used, or, alter- 

natively, 09 is changed each finite interval by an adjustment +- AO if the 

mean value of d0/dt over an interval t exceeds 0, or — AO, if not (a) 
and (b) are functionally identical 
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providing that it is always presented with the same statistically 
stationary environment. When the environment is non-stationary 
however, the p values are in continual flux. If the environment 
is characterized by several distributions J/,, //,,..., Zz, there 
may be up to L different successful adaptations and the machine 
will have to re-adapt whenever J/,, r = 1, 2,..., L is changed. 
Thus, having adapted to J/, by forming Pyr the machine may 
be presented with //,;, when it must start to adapt once again. 
As a result of achieving the successful adaptation P,r+, it loses 
all trace of Py and is at no particular advantage if /7, occurs 
subsequently. What is needed (and what may readily be provided) 
is an attention mechanism Q as in Figure 8 selecting amongst 
differently matured networks according to evidence from the 
immediate behaviour of the system. Each box M in Figure 8 
Tepresents a complete machine of the form II in Figure 7 select- 
ed by Q according to the convention of Figure 2. 

Other probabilistic learning machines (Steinbuch’s”® learn- 
ing matrices, a system due to Ratz and Thomas!®, and the 
‘Eucrates’ systems devised by Bailey, McKinnon Wood and 
Pask®) closely resemble the original formulation of Uttley’s. 
They differ chiefly in emphasis (in the ‘Eucrates’ system, for 










The 
Environment 
characterized 
by 7.7, ort, 





Figure 8 





example, the idea of an attention mechanism is emphasized) or 
in identification (they are not, as the Uttley model is, wedded 
to a neurological hypothesis). 


2.4. Other Learning Machines Built from Networks of Elements 


There are many adaptive networks in which the coupling 
weights aj; between the elementary units are varied as a function 
of the activity of the system and an externally manipulated rein- 
forcement variable. Taylor?’ 1% for example, has an artefact 
made from linear elements (the impulse rate at the output is 
proportional to the weighted sum of the impulses arriving at 
several inputs). Widrow”’, Willis?® and others have built net- 
works of threshold devices (which satisfy eqn (7) with 8 a con- 
stant) and Babcock* constructed a special purpose computer in 
which the elementary units can be variously programmed. All 
of these systems can be induced, by suitable training, to respond 
in a given fashion to patterns of input stimuli. Probably the 
largest body of data in this field is due to Rosenblatt?! who has 
fabricated many different adaptive pattern recognition machines 
called ‘Perceptrons’. A typical Perceptron is shown in Figure 9. 
The ‘response’ units are threshold units with constant weights. 
The ‘association’ units are threshold devices with variable weights 
ay, the change Aq; ; depending upon the correlated activity 





Reinforcement 
from 
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Association 


Retina units Response 


units 





Figure 9 


appearing at the input and output of the jth element and upon the 
approval of the external mentor. Rosenblatt’s original descrip- 
tion allowed for ‘random’ connectivity. More specialized versions 
of this system have properties that vary from a passive logical 
filter, able to abstract a given attribute of its input to self- 
oscillatory systems within a back coupled mesh. Between these 
limits are networks that act like conditional probability mech- 
anisms. The idea of ‘random’ connectivity, which appears rather 
often in the literature, is more fruitfully reinterpreted as a modi- 
cum of overconnection in the sense cited in 2.3 that is 
removed as a result of adaptation. In the first place the network 
is neither ‘random’ for the reasons cited in 1.8, nor is it ‘random’ 
in the sense of being unrestrictedly variable. As Papert*®? points 
out, any m input network which was unrestrictedly variable (in 
the sense of being able to compute all 22” functions of m binary 
input variables) would be unrealizable for sensibly large values 
of m since no training strategy could reach the desired adaptation 
within an acceptable time. Fortunately, the issue does not arise 
in practice for many special limitations are built into the specifi- 
cation; for example, threshold elements can only compute linear 
discriminating functions (if the oj are coordinates of a state 
space the values of the a determine a linearly discriminating 
plane Lain: = xy and m = 1 if the input state is on one side 
of this plane and 7; = 0 if it is on the other). This case is 
examined by Papert®*, Singleton** and Scott Cameron™ who 
show that the percentage of Boolean functions that are linearly 
discriminating decreases rapidly with increasing value of m. Ivan- 
henko*® considers similar problems in the case of a perceptron. 


Figure 10 


Input 
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It may turn out that the class of linear discriminating func- 
tions, although tractable in the sense that it is a small enough 
domain for an adaptation strategy to work in a reasonable time, 
is a rather inept choice. Many other classes of Boolean function 
would be acceptable. Willis®*, for example, proposes networks 
restricted to the computation of disjunctively decomposable 
functions of their input and it can be shown that such a network 
is separable into components, as indicated in Figure 10. It is 
important to avoid confusion between this separation of com- 
ponents and the hierarchical organizations considered in 2.5. 


2.5. Ashby’s Homeostat and More Elabarate Hierarchically Or- 
ganized Controllers 


The Homeostat is shown in Figure J] in the form described 
by Ashby®? (a more versatile apparatus has recently been de- 
veloped by Haire, Harouless, Miller and Williams**) and a 
homeostat with memory has been constructed by Chichinadze™’. 


Overall 
controller 


| peeezh 
aaa 





Figure 11 


The original Homeostat consisted of four null output posi- 
tional servomechanisms. The positional output of any one can 
be applied (depending on a plan of interconnection) to the input 
of any other unit. Each positional output potentiometer is pro- 
vided with limit indicators and, in the original demonstration, 
was identified with one essential variable of an organism. Ashby 
used the apparatus to demonstrate ‘ultrastability’ which is a 
generalized version of adaptive control. Suppose that the homeo- 
stat has a plan P, of interconnection and is given an input (from 
some external environment) that perturbs several of the units. 
Now, given P,, the system may or may not be stable. If not, at 
least one of the essential variable potentiometers indicates a 
limit value (which is taken to contravene the stability condition 
for the system). In this case, an overall controller (equivalent 
to Q) is required to select another plan of connection between 
the units, say P,. If the system is stable, given P, and a typical 
environmental input to perturb it, no further change occurs. 
If not, the overall controller makes a different selection. 

In the homeostat, the overall controller strategy is the most 
generalized (and also the least provident) conceivable. The ma- 
chine aims merely for stability, P, and P, being selected in an 
arbitrary fashion and independently of the state of the system. 
An adaptive controller B, C, is a specialized version of this 
paradigm. However, note that the boxes c € C do not correspond 
to the physically discriminable units in a homeostat. They cor- 
respond, instead, to subsystems created by an interactive cou- 
pling specified by P, or P;, between these units. The computing 


machines are well defined organizations, but they are not neces- 
sarily well localized physical entities. This comment has marginal 
relevance in connection with simple homeostasis but it becomes 
important when considering extensive hierarchies of control as 
in Figure 12. 
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Figure 12 


MacKay* #9 was the first to consider large, hierarchical, as- 
semblies of computing systems. Using the argument of 2.2. 
(which is suggested by MacKay) a mechanism A may be 
regarded as selecting amongst signs that denote the invariants 
maintained by subcontrollers b € B, the b € B may be viewed as 
selecting amongst signs associated with the ce C. Hence the 
entire learning machine is capable of abstraction (in particular 
of abstracting a goal or homeostatic criterion). Andrea*! dis- 
cusses the issue in detail and Mesarovic and Banerji!’ have 
developed a theory of a self-organizing system based mainly on 
the idea of an hierarchy of control. 


If the goal is well determined and unchanging, the structure 
has little interest. There are, however, many environments to be 
controlled in which the goal is nebulous, for example, if the 
control mechanism is used in management it will probably be 
asked to maintain ‘happiness’ and ‘productivity’. Now even if 
one presupposes that a machine construction for ‘happiness’ or 
‘productivity’ maintenance can be upheld, it is still necessary to 
admit that these properties are differently interpreted upon 
different occasions, and that the level as well as the form of 
abstraction involved in their expression is variable. If the ma- 
chine were less flexible, control would be impossible, because 
the required invariance is specified with reference to a parameter 
which is given different ostensive definitions upon different 
occasions. 

A very similar dilemma is engendered when the environment 
to be controlled is made up from incomparable parts. In this 
case, the control problem is solved by a machine that contains 
an attention mechanism as given in 2.3. In order to achieve 
a given goal, this machine must ‘attend’ to different inputs 
upon different occasions and ‘contemplate’ different response 
alternatives. 

With reference to it can be said that any machine which 
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which includes an hierarchy of control, or of goals, can, in cer- 
tain circumstances, be said to contain an hierarchy of meta- 
languages and that any machine able to build such an hierarchy 
of control can, in the same circumstances, be said to ‘construct’ 
metalanguages. Manifestly, such a machine can build abstrac- 
tions and, given the provisions of 1.5 it may be said to use 
‘concepts’! 48, 


2.6. A Self-organizing System, Localized, and Unlacalized Auto- 
mata 


Before considering the issues of attention and goal varia- 
tion in earnest, let us examine one, slightly more tractable, sys- 
tem. This is a special case of a so called ‘self-organizing system’ 
(it is commonly agreed that the term ‘self-organizing system’ is 
unfortunate since the entity concerned is really a sequence of 
systems). Now according to von Foerster*: *: 4° a subsystem 
J; (one member of this sequence) is a self-organizing system if 
the rate of change of its redundancy, R, is positive 


dR(J,)/dt>0 (13) 


The special case to be considered is a model that represents 
the data processing entity which is referred to as ‘a man’ when 
one speaks about a ‘man’ learning or performing a realistic skill. 


X1,%, Rule TT; 
25%, Rule TI2 
Set of 
A Y= Set of relevant 
possible : 
inputs and outputs inputs and outputs 


Figure 13 


There is plenty of evidence that a conscious ‘man’, regarded 
as a data processer or computer, has the characteristic that in 
order to survive he must maintain a certain rate of adaptation 
(the environment must provide sufficient novelty for him to learn 
about)’. Another undeniable fact about a ‘man’ is that he is, 
over any finite interval, a finite automaton with a well defined 
input called his ‘field of attention’ (and, incidentally, a well de- 
fined output of responses that in some way act upon his ‘field 
of attention’). In so far as the training routine or the actual skill 
occupy his attention, the field of attention, with relevant input 
states X; and output states Y; is coupled to relevant stimuli and 
responses as suggested in Figure 13. This condition would not 
be satisfied if the man were bored or inattentive. In order to 
satisfy the basic requirement that a certain rate of adaptation 
is maintained, the environment must be capable of offering sub- 
sets of stimuli and responses (X;, Y;) that have a statistical rela- 
tionship J7, which the man can learn. Where (X;, Yr) ¢ X, Y 
is the product the set of relevant stimuli and response states. 

This primitive image of man can be modelled with a set of 
the conditional probability machines given 2.3 and character- 





Figure 14. Selection of 7, Pyo, is shown 


ized by matrices Pyr as in eqn (3) and having input states X; 
and output states Y;. The environment is characterized by a set 
of matrices /7; (which may also be embodied in a machine as 
suggested in Figure 14). It will be convenient to assume that P,,, 
has equal p entries. Calling the rth coupled subsystem consisting 
of ‘man’ and his immediate environment J;, it is obvious that 
the adaptation rate condition is equivalent to eqn (13). In this 
model one can insist that J; will maximize its chance of survival, 
say by computing R (J;) as an index of adaptation and, from eqn 
(4), making 
H(J,) H(X,, Y,) 


(aa yat= H(J,) max at H (X,, Y,) max (14) 


and since H (X;, Y;) max is invariant for a given coupling, one 
has 0, ~ —H (X;, Y;). Thus, if eqn (14) is substituted in eqn 


(4) and if the values of the adaptive parameter @ converge, 
bo > ¢, > ... T one obtains 


d6,/dt~dR (J,)/dt>0 (15) 
for d + T. Wattanabe*’, for example, has considered the dy- 
namics of systems like this which must adapt and points out 
that learning behaviour as well as learning models satisfy 


RJ) |t# —H(J,) | t=a(t—to)e™ 


where a and b are positive constants and where t>f,>0 is the 
instant of observation. J,, however, is obviously an unstable or 
ephemeral self-organising system. Since the reinforcement vari- 
able, 0, is increased by any increase in the regularity of J; be- 
haviour, the system will approach a stable (though possibly 
dynamic) equilibrium characterized by 


(16) 


However, if eqn (16) pertains, J; is fully adapted, hence, at this 
point, 


P,,=stochastic inverse (J7,) 


dé, /dt~dR (J,)/dt=0 (17) 
contravening eqn (13) so that J; is no longer a self-organizing 
system (it can be demonstrated that although a ‘forgetting’ fa- 
cility can delay the instant at which eqn (17) applies, ‘forgetting’ 
does not avoid the ultimate instability). When this occurs, a real 
‘man’ will ‘change his attention’. To comprehend this fact, any 
model (cf. Figures 14 and 17) must include the transformation, 
say 2, embodied in a mechanism A with domain a set of M or 
more subsystems J;, J, € J, and with range J. Let 1 receive 
an instruction, whenever eqn (17) is true, to select 2 (Jr) = Jr+4 
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which consists of a conditional probability machine character- 
ized by Py, r+; coupled to an environment characterized by /7;+,. 
For an initial selection of subsystem Jp and for M iterations of 
this process the minimal constraints 


A(J,)eJ and A(J,)#J, 
yield a sequence 


r=M 
J*= vu [Wo] 
r=1 
such that dR (J,)/dt > 0 for all J-e J* S J. Thus J* is a self- 
organizing system. 

This model serves as a very primitive image of ‘man’ provid- 
ing (a) that the adaptation takes place with reference to some 
goal (like performing a given skill), not just arbitrarily (increas- 
ing R is tantamount to performing any skill), and (6) that 
although, in the limiting case, A may act when eqn (17) is true, 
it is also possible to replace this instruction with an internally 
or externally produced ‘attention directing’ signal. 

A more basic and interesting distinction between this model 
and reality is that, in fact, the set J is not denumerable. J is de- 
fined (and consequently renders the disjunction of eqn (18) 
legitimate) in terms of some property of J* which leads to as- 
sertions like ‘these systems all represent the behaviour of a man’, 
or ‘of an animal’ (which, since they refer to the systems rather 
than the behaviours, are metalinguistic statements). 

Typically, this situation comes about when there is a re- 
productive process capable of creating and replicating machines 
characterized by Py which survive if and only if, when coupled 
to form systems like J, d@/dt > 0. In this case, if a sequence of 
M systems does survive, the transformation / is interpretable as 
a rule of evolution that, in a specified environment, gives rise to 
J*, and the members J; of J* are characterized by the property 
that they are produced by this evolutionary rule. 

The point leads us to a different, ‘evolutionary’, kind of learn- 
ing machine. The previous learning machines stem from the 
familiar model of a localized and finite automaton (a finite auto- 
maton with well defined input states and output states). The 
model for evolutionary systems is a population of reproducing 
automata. 


(18) 
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Figure 15 


Suitable constructions have been developed and discussed by 
Burke", von Neumann’, and Loefgren*®:*!, In Loefgren’s 
development of von Neumann’s work, it is pointed out that 
a dual representation of these automata is commonly and per- 
haps necessarily adopted. 

(a) As a developing connection graph (with which there is an 
associated state graph). This method has been adopted by 


Rashevky®, Rosen®, and others. A transformation, 7 in 
Figure 15, maps graph G into graph 7(G). Often enough T can- 
not be readily expressed in an analytic form. 


(6) As a configuration on a ‘tesselation’ (a ‘tesselation’ is an 
infinite plane of cells, i, j, each of which can assume one of a 
finite number of states we U;, ue Uj, amongst which, for each 
value of i, at least one state is a null state denoted as wu € Uj). 
Entry into a null state is interpreted as the disappearance of 
some aspect of an automaton, which is a configuration of the 
states of certain neighbouring cells, and any transition from the 
null state implies the creation of some feature of an automaton. 
The transition rule, Z, depends, for the ‘th cell, upon wu € U; and 
the state of neighbouring cells. Hence Z is a mapping which if / 
and j are neighbours is of the form Z; (U;, Uj) > (Uj). 


If an automaton is localized its (a) image is a structure of 
functionally defined components and its (6) image is a connected 
configuration of other-than-null states. If an automaton is un- 
localized, in particular if it ‘evolves’, its (a) image is often diffi- 
cult to interpret because of an ingrained confusion between 
organizations and objects, whereas its (b) image is a replicating 
population of distinct configurations. 


It is worth noticing that: 


(1) In his discussion of ‘error correction’, Loefgren shows 
that in a real universe, where an automaton receives permanent 
structural perturbations as well as irrelevant inputs, no localized 
automaton can have an infinite life span but that some un- 
localized automata can have an infinite life span. This distinction 
demarcates the set of evolutionary learning systems. It is impor- 
tant, however, to avoid confusion between this and the well- 
known distinction of minimal automata and automata that 
are deliberately built with structural redundancy in order to 
effect error correction. 


(2) As emphasized in 1.2, the simulation of a learning 
machine entails the simulation of its normal environment. If 
learning is interpreted as the reproduction of an organization 
(which is, for example, the view adopted by Wiener) the environ- 
ment can be said to mediate the reproductive process. In fact, 
this interpretation (or some equivalent) is necessary in order to 
avoid Rosen’s®*! paradox (which is a special case of Russel’s 
paradox), but the term ‘environment’ implies an internal ‘envi- 
ronment’ such as a brain or some other physical embodiment in 
which the evolutionary process can be realized. It can be called, 
unspecifically, the medium in which evolution takes place. 

Because of the confusion between organizations and objects 
it is often difficult to specify the medium in the (a) image, but 
the (6) image shows a population in (or an organization that is 
a property of) the most abstract possible representation of a 
medium, namely a tesselation. This is a straightforward picture 
and, for most purposes, is preferable. 

Discussion of abstract systems that learn, in the evolutionary 
sense, is always related to the dual pictures of (a) and (6) but 
some contributors to this field, such as Masano Toda*®, Heinz 
von Foerster®!, Caianiello®®, and Pask®*; ®, have been differently 
motivated and oriented. A far less abstract version of a medium 
is used, akin to a close coupled, but real environment. However, 
it remains true that the evolving organization is a property of 
the medium rather than something that exists in its own right 
and a couple of points are worth emphasis. In the first place, 
the evolutionary rule, Z, is a property of the other than minimal 
specification of the medium thus illustrating the material 
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dependence cited in 1.5. Next, the medium in which learn- 
ing systems can evolve is highly constrained, but the constraints 
do not refer directly to the computations manifest in the learning 
behaviour. They exist in order to ensure that evolution of any 
system can take place. 


2.7. Different Realizations of Learning Machines in Evolving 
Systems 

Selfridge* has devised an hierarchically structured pro- 
gramme called ‘Pandemonium’ that can be used either for re- 
cognition or for control. The ‘Demons’ in a Pandemonium are 
computing routines and are isomorphic with possibly adaptive 
subcontrollers, as shown in Figure 16. The lowest order demons, 
in contact with the environment, recognize or control prede- 
termined features, indicating their achievement by signals con- 
veyed to higher order demons that assign weights to these sig- 
nals and transmit a linearly weighted evaluation of the lowest 
demons, performance to an overall controller demon which is 
criticized by some external mentor. 
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Figure 16 


The system resembles a Perceptron in so far as the middle 
demons perform a linear weighting and in so far as there is some 
external mentor. It differs a great deal in respect to the lowest 
demons, which in Pandemonium are highly specific feature re- 
cognizers or subcontrollers. A more important difference lies, 
however, in the development of the system, which is not revealed 
by the instantaneous picture. Whereas adaptation in a perceptron 
entails only an adjustment of weights, the demons in a Pande- 
monium must be capable of evolution. The designer is supposed 
to be prudent in his choice of features, but not omniscient. Con- 
sequently, subcontrollers that meet with common disapproval, 
evidenced by uniformly low weight assignments, must be discard- 
ed and others created to take their place. If a subcontroller in 
the organization of a Pandemonium is identified with a species 
of automata, the organization could be realized by a fairly simple 
evolutionary process. Members of a population of automata 
compete for some commodity (food or money) that is available 
in short supply and which allows them to persist and to repro- 
duce (this tacitly assumes a cost, in terms of food or money, 
for maintaining the fabric of an automaton and a similar cost 
for its replication). The supply of food or money is sufficient to 
maintain, on average, a number, 1, of automata, which (depend- 


ing somewhat upon the distribution of automata in different 
species) can sustain the activity of the required number of sub- 
controllers. Since more than m automata are created there is 
competition and some automata fail to survive (indeed, those 
that survive will be members of species favoured by the higher 
order components that act, presumably, upon evolutionary 
rules). 

Given some kind of variation, there will be a selection of 
favoured variants. The issue is, what principle of selection is to 
be used? Purely chance variation would be pointless, for the set 
of possibilities is too large, but two mechanisms are possible, 
namely: 

(1) Recombination. Instead of discarding an unpopular de- 
mon and creating a novel demon, the elements of a partially 
successful demon are recombined in a fresh arrangement. Hence, 
amongst other things, the adaptations of the original demon are 
not completely lost. 

(2) Co-operative interaction whereby the original demons are 
coupled into co-operative entities which are reproduced, by some 
different and possibly evolved mechanism, as a whole. 

Although a model of this kind is easy to describe and parts 
of it have been successfully simulated, there is a tendency to 
consider media more like brains than the environments of simple 
automata. Pringle®*, suggested that hierarchies of control could 
evolve in a large multiple mode oscillator due to the non-linear 
coupling between stable oscillatory modes. Such a system is 
conveniently realized by a large network of self-exciting thresh- 
old elements, with malleable coupling. The existence of a stable 
mode imprints a pattern upon this medium which, in turn, in- 
duces the stable mode or filters its components from uncorrelated 
excitation (in a stable system there will be a many to many cor- 
respondence between patterns and dynamic components). The 
existence of a stable mode acts as an informational ‘metabolism 
that maintains the constraints which are required in order that 
the network shall compute a specified function of its external 
stimulation (hence modes of oscillation are, in this case, the sub- 
controllers of demons). Networks like Beurle’s simulation’: 1°, 
mentioned in 1.2 are admirable media for this purpose and 
Beurle has examined the self-excited as well as transmissive 
characteristics of this system. Amongst others, Farley and 
Clark®* have simulated very large assemblages of somewhat 
simplified non-linear elements. Taylor’s paper, at the present 
meeting, deals with this area. 

From a theoretical viewpoint, Wiener* has recently propos- 
ed a potentially quantitive theory of learning and has tentatively 
applied it to data from cortical activity. As a generalization of 
our previous comments about oscillatory modes, replication 
(which is viewed as the basic mechanism of learning) is re- 
presented by the distortion free propagation of a wave of activity 
in a medium of non-linear elements. (The comments on filtering 
are generalized in terms of Wiener filters.) Further development 
of this work should lead to a precise and comprehensive image 
of the evolutionary mechanism that underlies learning®. 


2.8. Prablem Solving, Communication and Strategy in Linguistic 
Enviranments 

As suggested in 1.5, there are many situations in which 
the control engineer is required to build learning machines 
that interact with a symbolic environment. Often these situations 
are obvious enough; for example, the related issues of language 
translation and data retrieval certainly entail learning (at any 
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rate in large systems) and it seems unlikely that the learning of 
sign sequence correlations is sufficient (the machine must learn 
the language concerned). The same is true for the whole of 
‘artificial intelligence’. Other situations are not so obviously 
symbolic in character. Tustin®® has demonstrated that many 
facets of the economy may be predicted and perhaps controlled 
within a model that is derived from Laplace transformations of 
economic data. As a result, the design of a control mechanism 
appears to involve an application of conventional methods. 
However, there are discontinuities, due to invention, for ex- 
ample, which prove intractable within the framework of Tustin’s 
model. One can, of course, neglect these or, failing that, one 
can interpolate a guess, but it seems at least possible that a 
machine capable of symbolic interaction might guess solutions 
to these problems, better than oneself, but by using ones own 
methods which are (by hypothesis) to discern relevant signs in 
an underlying cultural language and to interpret them as indices 
of change (this, incidentally, will involve feedback perturbations 
of the controlled system and a machine capability for learning 
as well as manipulating the language). 


Another situation involving symbolic interaction is man 
machine symbiosis, for example, in problem solving (the term 
‘symbiosis’ is used to distinguish the present relation between the 
man and the machine from the more familiar relation in which 
a machine is used as a problem solving tool). In symbiosis, the 
aim is man machine co-operation. The machine should sug- 
gest hypotheses and proofs to the man, as well as carrying out 
his instructions. In order to do this, it must develop a common 
language (so that the man machine interaction has the logical 
calibre of a conversation). As a result, in a genuinely symbiotic 
system, it would be impossible to tell whether the man or the 
machine was responsible for the solutions achieved. 


Very little work has been done, so far, in symbiotic problem 
solving but some results are available for comparable symbiotic 
systems used for training (the control of learning) and mental 
testing (stabilizing the man machine interaction to maximize 
the information available regarding the values of specified para- 
meters). 

A few specific mechanisms are now discussed in an attempt 
to illustrate some of the chief difficulties of the field. 


In order to talk about signs, it is first necessary to specify an 
object language Ly (which is used by the system concerned) in 
terms of an observer’s metalanguage L*. As Cherry®’ has pointed 
out, this expedient is a formal prerequisite for any kind of ex- 
perimentation (but it is tacitly assumed and taken for granted 
when the experimental technique is well known and the L* inter- 
pretation of Lp» signs is invariant). If one also wishes to assert 
that an organism or a machine manipulates signs (in particular, 
that it learns about signs or acquires concepts) one must either: 


(a) Be in a position to embed symbolic representations with- 
in the machine (trivially by building them into it and not entirely 
trivially by the ham-fisted reinforcement of an adaptive network 
so that it recognizes the symbols determined by its external 
mentor), or 

(b) Understand a common environment (by knowing ‘essen- 
tial variables’, for example, so that, in the absence of rein- 
forcement it can be taken for granted that the organism or 
machine will learn to use signs denoting the critical value of 
these ‘essential variables’ if it survives. Failing this, it would 
encounter the critical condition rather than a sign for it). 


Returning again to 1.5, the fact that not all abstrac- 
tions can be generalized is tantamount to the fact that not all 
signs act as symbols for a given machine and its environment. 
To cite a picturesque but defensible analogy, the population of 
automata in an evolutionary learning system must (if they inter- 
act co-operatively as in 2.7) possess an ‘internal’ language 
with symbols of its own. Generalization is possible if this ‘inter- 
nal’ language can be translated into the external language and 
vice versa). 

Having established Ly and represented it in L* a control 
mechanism can be regarded as a problem solving device. The 
perturbations to which this controller responds correctively pose 
problems (certain equivalence classes of xe X¥ denote prob- 
lems and certain ye€ Y that index the controller’s asymptotic 
response denote solutions). The controller becomes a problem 
solver (an adaptive controller becomes able to solve a particular 
problem) and the functions computed by the fully adapted 
machine specify a decision rule. A largely equivalent formulation 
images the machine as a game player when certain y€ Y are its 
moves, certain x € X are the moves of another player, 8 (x, y) 
is the payoff function of the game over the outcome set LY, Y, 
and the computed function determines strategies, # (as describ- 
ed, the game is played in its extensive form). 

An abstractive or hierarchical learning machine is capable of 
quite elaborate problem solving and game playing since it can 
learn to solve a class of problems by constructing descriptions 
of problem solution. An hierarchical structure can embody as 
many machine metalanguages L,, Lo, ... (describing and manip- 
ulating Lo) as there are levels of control, and evolutionary sys- 
tems are able to build L,, r= 1, 2,... (which are, in some sense, 
interpretable as conceptual languages). 

The issue of 1.5 reappears, however, namely ‘in what sense 
are these levels of control conceptual levels?’ Providing there 
are no overtones of mentation, any abstractive learning machine 
can have and some can build ‘concepts’, but in order to 
remove the inverted commas, it is necessary to consider the dis- 
course between this machine and other machines or other men. 

A pair or more of learning machines can interact at several 
levels of discourse, Lo, Ly, ... either through couplings establish- 
ed between different levels of control or, due to similarity of 
construction, by inferring the L,, Lo, ... implication of expres- 
sions in Ly. George®: © has experimented with multi-machine 
computer simulations and has examined the linguistic require- 
ments entailed in the successful conduct of non-zero sum, partial- 
ly co-operative, games and certain sequential games. The entire 
experiment is explicated, of course, in L*. 

On the other hand, in connection with ‘artificial intelligence’, 
it is not only necessary to describe the problem solving in terms 
of L*, but also to interact, personally, with the machine. 

A typical man machine system is proposed by Edwards”. 
The job of constructing a Baye’s estimate of the probability of 
an hypothesis, conditional upon certain data 


p(Data | Hyp): p(Hyp) 
p(Data) 


is ideally divided between a man and a machine. There is evi- 
dence to suggest that a man can be trained to accurately deter- 
mine p (Data| Hyp) which a machine cannot readily do (especial- 
ly if much of the data is irrelevant or if the hypothesis is tenta- 
tive) providing he receives a guide from present values of 
p (Hyp| Data). On the other hand a man is a notoriously in- 


p(Hyp | Data)= 
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accurate calculator and in Edward’s system, the Bayes’ calcu- 
lation is conducted by a machine. 

The extreme case of man machine interaction, conversational 
man machine symbiosis, is exemplified by the adaptive teaching 
system in Figure 17 (these systems have been used to instruct a 
number of industrial skills including perceptual motor skills, 
like card punching and intellectual skills like fault detection on 
the basis of imperfect evidence)?!~"4. 

Broadly, the machine acts like a personal instructor who 
changes the training routine to suit an image of the student, 
derived from continual observation, by posing novel problems 





Machine 


Equivalent to- 





Figure 17 


when the student is bored and simplifying the material or co- 
operating if the displayed data is too difficult. Precisely, the 
system is a stochastic optimizer based upon the model given 
in 2.6. The machine selects subenvironments, characterized 
by the different matrices /7,, with an operation 2 analogous to 
the operation / (in the ‘student’ part of the model). Adjoining 
the plausible condition that each of the adaptive ‘automata’ 
(that momentarily represent the student) must receive intelligible 
problems to solve, and also interpreting 6, as an index of 
relevant regularity derived from a success measure in such a way 
that AO, is an index of learning rate in the rth subenvironment 
(or for the rth subskill) it is possible to specify a strategy. 

Now Figure 17 represents a feasible system if there is enough 
information about Ly to determine a ‘structured skill’ (in partic- 
ular to reduce the skill to sets of problems x* € X,* < X; con- 
cerned with the rth subskill wherin if, for each r, there is a ‘sim- 
plifying’ procedure that renders any x* € X* a more intelligible 
simplified problem, denoted by a stimulus x € X;— X*). If so, it 
can be shown that a pair of strategies C, and C, will maximize 
the rate of directed self-organization in the system J* of the 
model of 2.6, namely: 


C, given that Xp is selected, hence Jy, vary the simplification 
of an arbitrary sequence of x selected from Xp to minimize the 


degree of simplification and to maximize A§p. If this proves 
impossible in Jp inform A and institute C,. 

C, Let A select X;, hence J;, such that the expected value of 
A@ is maximized. Proceed with C,. 


There is some evidence to indicate that C, and C, are, in a 
certain sense, ‘optimal’ strategies that maximize learning rate, 
but for most skills a much more elaborate hierarchical adaptive 
mechanism is needed to realize C, and C,. Note that although 
the selections made by strategy C, are expressions in Lo, the 
selections made by A, embodying strategy C,, are expressions in 
a machine metalanguage, L, say, since they instruct the student 
to direct his attention to a different subskill”. 

In Figure 18 a similar adaptive device is used to minimize the 
error rate in an inspection skill. The adaptive machine in 
Figure 18 injects ‘false’ signals into the inspector’s display, senses 
whether he misses the signal, and adjusts the injected signal 
distribution to minimize this error rate. All ‘false’ signals are 
automatically excluded from the system output. 

These machines have the status of catalysts that control an 
evolutionary process in the student’s brain. They could be ex- 
tended (by providing a capability for building an hierarchy 
L,, Ls, ... of metalanguages) so that this process became increas- 
ingly exteriorized (the machine could act as a medium, compa- 
rable to the brain, in so far as it manipulated ‘concepts’ like the 
student, but due to co-operation there is a possibility of a man 
machine compromise about the form of ‘concept’). 

If the man is removed from this extended system, it becomes 
an undiluted ‘artificial intelligence’. The ‘induction principles’ 
demanded by Church’*, Markoff”, and others in an intelligent 
system must stem from the evolutionary rules that constrain the 
machine. Given that these are specified rightly, the ‘artificial 
intelligence’ will solve problems as man solves them. One method 
of embedding these rules might be to allow a learning machine 
to mature in contact with a group of men and a common lin- 
guistic environment. The other possibility is to apply constraints 
upon the machine which may be 

(a) Heuristics determining classes of solution or classes of 
algorithms. 

(6) Cost functions, determining how much the machine has 
to ‘pay’ in terms of a necessary and limited commodity, to main- 
tain different structures or to apply different algorithms. 
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Some features of the better-known intelligent learning ma- 
chines are now briefly outlined: 


(1) One of the less elaborate systems is Manfred Kochen’s” 
problem solver which applies various search strategies in a situ- 
ation proposed by Bruner, Goodnow and Austin” in their 
studies of problem solving and learning. Its behaviour compares 
reasonably with human behaviour in the restricted experimental 
environment of these studies but the system is not intended to 
do more than this. This learning machine is an adaptive device 
that modifies the parameter, @, of the function set defined by its 
strategies. 


(2) The hierarchy of control (in a symbolic identification of 
languages) is isomorphic with the structure of a list programme. 
Hence problem solving machines like Newall, Shaw, and 
Simon’s G.P.S.8°-®? are abstractive learning machines. Variation 
of the hierarchical structure takes place as a function of the 
experience gained by applying the listed tests. Learning has at 
least a couple of different forms. (a) Learning to apply a test 
which entails the development of a local metric over the choice 
tree (equivalent to adaptation at a given level as in Fiegenbaum 
and Samuel’s device™®). (6) Learning to order tests and modify 
lists, which amounts to changing the hierarchical structure. 


Minsky**: §4 has particularly emphasized the role of heuristics 
or broad principles of problem solving that are built into a 
machine and guide it in selecting the algorithm it applies to its 
working data and to its own structure. Unless these are built in- 
to the device there seems no hope that any sort of order will ap- 
pear. The heuristic constitutes an assertion of how a species of 
learning machine will look at its environment. 


(3) Together with Minsky, Marzocco, Travis*®, and others 
at S.D.C. take the very reasonable view that the most important 
heuristic determines a principle of restricted generalization 
whereby a given machine learns to apply a ‘generalized’ form 
of a previously successful algorithm when it is subsequently pre- 
sented with a ‘similar’ problem. The crux of the learning process 
is to assign a connotation to the words in inverted commas. 

This is also true of Amarel’s®® ‘theorem proving’ machine 
that ‘learns to learn’ in a limited domain (it proves ‘theorems’ 
that are mappings from a finite product set of elements into one 
coordinate of this set). Even here the search process (which 
entails developing choice trees and building up lists of descrip- 
tions of these procedures) is impracticable unless it is somehow 
restricted. Amarel urges (a) the need for a variety of descriptive 
metalanguages Lo, L;,... in any intelligent learning machine (5) 
the need for some kind of economic control over the search 
process. Apart from obvious measures like informational value 
that can be used to characterize the result of a test, the machine 
must have to pay for making tests and maintaining the test 
systems in terms of a restricted commodity like the food or 
money discourse in 2.7. A similar point is made by Solomonoff®’. 


(4) Maron?’, points out that the basic problem of artificial 
intelligence is choice. The logical rules that are built into a 
machine and the probabilistic bias it may acquire by learning, 
determines only what it may not do (the set of alternatives it is 
allowed to contemplate and those it may not contemplate). Noth- 
ing is said about what it should do within the contemplated set 
(an instruction to ‘throw a dice’ indicates our indifference but 
the act is realized by a specific and often inept ‘decision proce- 
dure’). A machine that genuinely learns to solve problems must 
learn to choose and is likely to develop preferences and oddities. 


3. The Mechanism in which Learning Machines are Realized 


3.1. Brain Analogies 


(1) The brain is probably the best immediately available guide 
to how the various constraints should be realized by the control 
engineer. This realization is the concern of Bionics or Biological 
Cybernetics. Many of the hoary contentions of this field, like 
how ‘random’ or how ‘specific’ a brain may be, are loosing their 
edge. Nowadays, few people subscribe to the doctrine of ‘ran- 
dom’ networks and few people imagine that the brain is a large 
‘telephone’ system. There remain a number of less extremist 
biologically oriented design principles for learning machines that 
appear to have more than polemic value. These must be applied 
with the recognition that other than biological brains may have 
entirely different optimal designs. 


(2) Brains do (and one suspects that learning machines must) 
have a well defined overall plan. The Ly informational interface* 
between the organism and its environment is characterized, at 
the input, by an hierarchically arranged filter that extracts per- 
ceptual attributes. At the output, there are hierarchically con- 
trolled motor repertoires that give rise to sequences of coherent 
actions such as ‘speech’ or ‘eating’ each of which can be regarded 
as a whole. Between these specialized regions, there are other 
more or less organized entities concerned with integration and 
various forms of control. Probably the most comprehensive 
cybernetic model is provided by Anohkin*’, 


Within such a broad plan the brain has well defined func- 
tional units such as the various distinct memory systems, dis- 
cussed, for example, by Brown*®*® and Barbizet®. 


(3) Very specific filtering networks have been constructed by 
von Foerster® and by Harmon®. This work has been stimulat- 
ed and guided by the physiological investigations of Lettvin, 
Matturana, McCulloch and Pitts®*, of Hubel and Wiesel 
of Reichart® and others. Novikoff® has outlined principles for 
designing attribute filters on the basis of integral geometry. An 
axiomatic specification of a pattern is given by Lerner and 
Vashnik?”’, 

At a more macroscopic level, Lerner!*! has proposed reflex 
analysing mechanisms which suggest and have been embodied 
in machine organizations. Finally, it can be argued that there is an 
information theoretical analogy between the filters of a learning 
machine and certain forms of communication and computation 
channels. Agalides®* considers the matter at a microscopic level 
and Broadbent’s®® model (which is interpretable as a psycho- 
logical hypothesis) refers to the macrostructure of a learning 
machine, 


(4) Some of the filters in a brain are adaptive and some are 
predictive. On the basics of this analogy, Barlow and Donald- 
son? have, built a highly specific coding system which adapts 
to minimize the redundancy in an input sequence which could 
be plausibly derived from an attribute filter. Uttley’s!! two 
variable conditional probability theory is interpretable in terms 
of a neurological mechanism through an identification between 
eqns (9) and (10) of 2.3 and some physiological process. 


* Notes. The L, or L, description interface is normally different. 
A man or any learning machine has a definite location with refer- 
ence to a given descriptional language. If this language is not specified 
he or it has an undetermined boundary. Man is bounded in one 
way as a social being and in another way as a food eating creature. 
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He is, at the moment, testing this hypothesis and the view that 
dendritic fields in the brain provide the connectivity of a predic- 
tive network. Almost all of the adaptive filters have some claim 
to biological precedent. Thus Rosenblatt’s Perceptron was in- 
spired by Hebb’s learning theory and assemblies of the kind that 
Taylor and Steinbuch have built are consonant either with 
Hebb’s view! 2 or more specific theories such as those of Milner’, 

In brains, the hierarchical structure of control is partly 
apparent in the anatomy. Bishop!” points out that the Broad 
Pattern (which is anatomically defined) has evolutionary origins. 

Braynes, Napalkov and Sechvinsky!®, and Napalkov®”, 
have developed a particularly elegant description of conditioned 
reflexes in terms of algorithms. In many animals it is possible 
to exhibit the hierarchy of control as an interaction between 
different levels of algorithms which can, of course, be rigorously 
specified as a relation between normal algorithms in the sense 
of Markoff. The hierarchy of algorithms describes a structure 
of reflexive chains which have, at any moment, a definite loca- 
tion, but there is no reason why the description should always 
refer to the same physical structure. Hence the hierarchy of 
algorithms is an organization which sometimes may and some- 
times may not reflect an anatomical hierarchy. These authors 
have built several machines to illustrate their own ideas and the 
obviously related concepts of the response hierarchy biologists 
such as Tinbergen!®’. One of the most comprehensive hier- 
archical learning systems is described by Andrea*". 

(5) The brain is a parallel computing mechanism and it has 
a great deal of structural and functional redundancy. Work such 
as Cowan’s!’8 on error correction in computing networks leads 
to an image (the design of a typical error correcting, stable, 
network) which is also highly redundant and which closely re- 
sembles the connectivity of a brain. In this kind of system, the 
same function may be mediated by many different components 
in many different locations. The brain is many physical realiza- 
tions of a computing system rather than a single realization (a 
brain is functionally similar to a computing system—it bears 
little resemblance to the object called a computer). The minimal 
components are not always well defined and the attributes of 
physical events that count as signals are rarely well defined (thus 
in some parts of a brain, neurone groups act as the minimal sub- 
systems, in other parts, single neurones, and in others the com- 
ponents of a neurone. Similarly, ‘signals’ may be impulses or 
the phase relations between impulses or the mean rate of im- 
pulses, or analogue variables that are not directly related to im- 
pulses). It is not unreasonable to specify this kind of micro- 
scopic redundancy of function in the components of any learn- 
ing machine. Certain obvious advantages are obtained. The 
trouble is that the idea of a ‘component’ and the familiar dis- 
tinction between signals and objects that modify these signals is 
lost. 

Finally, there is a redundancy at the level of macroscopic 
properties. It is probably pointless to ask what in a brain medi- 
ates its memory. An indefinitely large number of physical mech- 
anisms might (and in the sense of being history dependent) must 
act in this capacity. The real issue is to discover which are re- 
levant to the computations that interest us. Thus a brain can 
have memory in terms of excited loops of neurones, in terms of 
a distribution of facilitation that favours the propagation of a 
particular wave of activity, in terms of specific macromolecular 
changes at the synaptic interface and possibly in terms of local 
and specific changes in messenger R.N.A. There is no reason to 


suppose that any one of these is responsible for any facet of 
memory and their relative importance probably depends upon 
the state of the brain. Given the component specification sug- 
gested a moment ago, similar properties may be expected in a 
learning machine. 

(6) The brain includes an attention mechanism, chiefly em- 
bodied in the reticular formation which, according to McCul- 
loch and Kilmer! can be regarded as an input and output 
transforming system with parameters controlled by the higher 
centres. It determines the level in an hierarchy of control at 
which a feedback loop, involving the data occupying the brain’s 
attention, will be completed. Ideally this is the level of abstrac- 
tion suited to a given type of data. It has already been argued, 
on entirely different grounds, that an attention mechanism is 
needed in any competent learning machine. 

(7) The fact that a brain is a collection of closely coupled 
evolutionary fragments, does not appear to be accidental, nor 
is the fact that the ontogeny of a brain recapitulates the phylo- 
geny of the brain of a species as the system undergoes the process 
of maturation. Such manifestations as imprinting (that certain 
behaviours are acquired by the organism if and only if a specific 
state of its environment occurs at a specific point in its develop- 
ment) are a predictable consequence of maturation and unless 
an alternative is suggested, any realistic learning machine must 
also undergo a process of maturation as proposed in 2.8 in its 
normal environment. 

In this connection, Young’s investigation!** of the octopus 
has recently revealed that the brain of this animal is an hier- 
archically organized homeostatic system in which the hierarchi- 
cal levels are related to the sensory and motor systems especially 
characteristic of distinct evolutionary stages (touch, visual dis- 
tance receptors, and visual pattern systems). The octopus is 
peculiar in that the functions of its brain are localized (and, 
hence, are readily discriminated) whereas in most animals the 
functions are distributed in a way which obscures the overall 
plan. However, one may hypothesize that the organization ex- 
posed by Young’s work persists, even when its embodiment is 
hard to discern as it is in a mammalian brain. 

(8) There are regions in a brain that could, very readily, act 
as the medium in which dynamic organizations evolve. Relatively 
undifferentiated parts of the cerebral cortex display an activity 
like Beurle’s networks and although it is possible that evolu- 
tionary processes, in the sense given in 2.8, occur in almost 
any biological system, it is tempting to suppose that these parts 
are specially developed as a ‘medium’. There is plenty of evidence 
that evolutionary processes, in the sense given in 2.8, go 
on in brains, that they underlie the learning behaviour of the 
associated organism and serve to extrapolate its biological evo- 
lution into the field of language and social interaction. It has 
been argued, in less flowery terms, that much the same comments 
may apply to learning machines. 


3.2. Non-biological Embodiment 


Since useful learning machines are certainly large, their com- 
ponents (with the reservation of (5) above upon using this word) 
must be small. There seems no reason why one or another of 
the microminiaturization techniques (such as the techniques dis- 
cussed by Shoulders) cannot be used to fabricate the perceptual 
filters and motor hierarchies of (2), (3) and (4) above in conven- 
tional circuitry. The fabrication need not be perfect, since the 
system is stable against crass structural disturbances. These 
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methods are not able to realize a competent version of matura- 
tion [required by (7) above] or of a medium [as in (8)]. So far 
as maturation is concerned, the constructing device for etching 
or embedding the circuit would have to be controlled as a func- 
tion of the interaction between the brain it is producing and the 
normal environment of this brain and although this is not impos- 
sible, the expedient seems, at first sight, to be unduly cumber- 
some. 

Various attempts have been made, chiefly by MacKay and 
in my own laboratory, to ‘grow’ connectivity in a network using 
metallic threads or dendrites that develop by electrodeposition 
in a form determined by the activity induced through the con- 
nections they instantaneously determine. Whereas MacKay? 
has brought to competence a method of ‘growing’ dendrites that 
act as connective elements, our own work yielded a very crude 
but somewhat wider application of this process. Briefly, it was 
possible to construct, from dendrites and capacitors alone, a 
realized network of Crane’s™!: 1!" ‘neuristors’ (a ‘neuristor’ is 
an active transmission line, of which a nerve fibre is a special 
case. Crane has shown that suitably coupled networks of neu- 
ristors are able to compute any Boolean or Probabilistic func- 
tion. Their realization, using Lilley’s passivated wires, is con- 
sidered by Stewart!!*). In the present system, instable dendrites 
subserved the non-linear switching action of a neuristor whereas 
stable connective dendrites acted as the coupling and charging 
impedances. If the system is chosen so that the deposition is 
partly reversible, the dendrites have a reproductive character- 
istic and competitive and co-operative interaction takes place 
between members of a population of developing dendrites* ** ®°. 
Since, as Feldman! points out, a neuristor system is ideally 
suited to act as a medium for evolving dynamic organizations, the 
dendrite realization of a neuristor network appears particularly 
attractive and it may be possible to use this system now that 
more elegant procedures have been developed. 

The basic criticism is that the size is too large. Ideally, the 
active transmission lines should be large macromolecules and 
Bowman?» argues that certain polymers could be used for this 
purpose. To complete the specification, it is necessary to transmit 
signals along polymer ‘neuristors’ whilst the polymerization is 
in progress, and to use these signals through a local catalytic 
action to control the further extension of the ‘neuristor’ network. 
Physical chemists seem willing to countenance the possibility of 
controlled catalysis, at least in principle. 
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